This documentation shows how to screen scrape the Hindawi APC' table

Requirements

First you will need to install Python. The easiest way to do so is to install Anaconda



In [16]:

    
from IPython.display import YouTubeVideo, HTML, Math, Image
# how to install anaconda on mac
YouTubeVideo('6Dv1wNvTPbg')









    Out[16]:

Than you can copy every cells here or you can import this document in your anaconda folder (this part is not shown in the video I'll try to find an other video)

Import all the necessary module



In [17]:

    
#This module will help you retrieve the content within the Hindawi table
from bs4 import BeautifulSoup
import urllib
import pandas as pd

Retrieving the hindawi html source page



In [18]:

    
hindawi_apc_url = 'http://www.hindawi.com/apc/'

hindawi_html_page = urllib.urlopen(hindawi_apc_url) 

soup = BeautifulSoup(hindawi_html_page, 'xml')



In [19]:

    
HTML('<iframe src=http://www.hindawi.com/apc/ width=900 height=550></iframe>')









    Out[19]:

This Hindawi page contains one HMTL table that we need to parse. Please look at the w3school documentation about 'table' if your not familliar with HTML

Selecting or Targeting what we want in the Hindawi's webpage



In [20]:

    
#Within the Table all the 'tr' (table rows)
content = soup.find_all('tr')



In [21]:

    
'''
<tr class="subscription_table_head">
	 <th>Journal Title</th>
	 <th>ISSN</th>
	 <th class="last_th">APC</th>
 </tr>, 
 <tr class="subscription_table_plus">
	 <td>
	   <a href="/journals/aaa/">Abstract and Applied Analysis</a>
	 </td>
	 <td>1687-0409</td>
	 <td class="to_right">$800</td>
 </tr>
 ...
 '''
0









    Out[21]:





0

Because the first tr will contains the table header (Journal title, ISSN, APC) we will start retrieving content after the first tr.



In [22]:

    
table =[]

#start with the second 'tr'
for value in content[1:]:
    #This will find all the td within this 'tr'
    value = value.find_all('td')
    
    #index of VALUE: 0                ,     1,         2
    #value ===> 'Abstract and Applied', '1687-0429', '$800'
    
    apc = value[2].text.strip()
    #Let's remove the '$' sign if any
    if "$" in apc:
        apc = (apc.split('$'))[1]
        apc = int(apc)
    #if value == 'Free' than let's write 0 instead of Free
    else:
        apc = 0
    table.append([value[0].text.strip(),value[1].text.strip(),apc] )



In [23]:

    
hindawi_apc_table = pd.DataFrame(table, columns=['Journal Title','ISSN','APC'])

Let's display the first 10 values



In [24]:

    
hindawi_apc_table.head(10)









    Out[24]:






  
    
      
      Journal Title
      ISSN
      APC
    
  
  
    
      0
      Abstract and Applied Analysis
      1687-0409
      800
    
    
      1
      Active and Passive Electronic Components
      1563-5031
      600
    
    
      2
      Advances in Acoustics and Vibration
      1687-627X
      600
    
    
      3
      Advances in Aerospace Engineering
      2314-7520
      600
    
    
      4
      Advances in Agriculture
      2314-7539
      600
    
    
      5
      Advances in Anatomy
      2314-7547
      600
    
    
      6
      Advances in Andrology
      2314-8446
      600
    
    
      7
      Advances in Anesthesiology
      2314-7555
      600
    
    
      8
      Advances in Artificial Intelligence
      1687-7489
      0
    
    
      9
      Advances in Artificial Neural Systems
      1687-7608
      600

Export the table to an Excel file



In [25]:

    
# Export to Excel
hindawi_apc_table.to_excel('Hindawi_apc_table.xlsx', sheet_name = 'Hindawi_APC_Table', index = False)

Download the excel document (Hindawi_apc_table.xslx)

In resume to scrape the Table. You will need these steps:

Import all the necessary module: (BeautifulSoup and Pandas)
Retrieving the html page: soup = BeautifulSoup(hindawi_html_page, 'xml'
Targeting the table
Within this table, you will need to loop through to retrieve what you are looking for

	Journal Title	ISSN	APC
0	Abstract and Applied Analysis	1687-0409	800
1	Active and Passive Electronic Components	1563-5031	600
2	Advances in Acoustics and Vibration	1687-627X	600
3	Advances in Aerospace Engineering	2314-7520	600
4	Advances in Agriculture	2314-7539	600
5	Advances in Anatomy	2314-7547	600
6	Advances in Andrology	2314-8446	600
7	Advances in Anesthesiology	2314-7555	600
8	Advances in Artificial Intelligence	1687-7489	0
9	Advances in Artificial Neural Systems	1687-7608	600